Retrieving similar websites and web pages
نویسنده
چکیده
Similar web pages are pages that are about the same topic and of the same type. Sites about soccer clubs are related with all the soccer websites, but only similar with other soccer clubs. The goal of this research is to find an approach which, based on the textual content, can find similar pages given a page. The method used for this approach is a twofold method. The first task is trying to find similar websites, the second task is to find similar pages on those similar websites. Similarity is based on the textual content and use keyword extraction to identify the main topics. The tf*idf measure is used to identify the best keywords. For calculating the similarity between two pages the Cosine Similarity measure is used. The conceived approach gives some satisfying results for finding related websites and similar pages, finding similar websites is a difficult task. The conclusion of this research is that it is hard to find similar websites only based on the textual content. Finally, there are some suggestions given for further improvements.
منابع مشابه
SIGCSE: U: Focused Retrieval of University Course Descriptions from Highly Variable Sources
Finding topically relevant content from sparse disparate sources on the Web requires robust techniques. A focused web crawler is a type of crawler that attempts to make predictions about page relevance and traverse the web efficiently to retrieve relevant information. In this work, we design and test a novel framework of focused crawling tailored to extracting semantically relevant information ...
متن کاملوب سنجیِ صفحات وب فارسی مرتبط با تغذیه براساس معیار سیلبرگ
Background and Aim: Considering the potential damages caused by inaccurate, inadequate and incomplete information published in web pages, the aim of this study was to evaluate Persian-language web pages containing nutritional information, using Silberg criteria. Materials and Methods: Internet pages related to nutrition were found in “peyvandha.ir” and by searching 20 nutrition-related keywo...
متن کاملStructure-Based Web Pages Clustering
Recognizing similarities among the documents of a set is one of the objectives of retrieving information. The information related to the similarities of web pages can be used to present similar documents to users in order to retrieve considered information. In the present study, a new algorithm has been proposed to cluster web pages based on their structure. The proposed algorithm is based on h...
متن کاملانطباق عناصر فرادادۀ وبسایت کتابخانههای مرکزی دانشگاههای علوم پزشکی با عناصر فرادادۀ هسته دوبلین
Introduction: Considering the importance of library websites in the establishment of communication and provision of services for their users, it is crucial to include those features in these websites which can lead to increased dynamism and optimal communication. The present study aimed at comparing Metadata elements of Dublin Core with those of the websites of Central Libraries of Medical Univ...
متن کاملAn Improved Approach to perform Crawling and avoid Duplicate Web Pages
When a web search is performed it includes many duplicate web pages or the websites. It means we can get number of similar pages at different web servers. We are proposing a Web Crawling Approach to Detect and avoid Duplicate or Near Duplicate WebPages. In this proposed work we are presenting a keyword Prioritization based approach to identify the web page over the web. As such pages will be id...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009